


Supervised Machine Learning to guess the anonymous authors of the HD Fan Fair 2019

by bafflinghaze



Series: Draco Malfoy/Harry Potter Specific Statistics [3]
Category: Harry Potter - J. K. Rowling
Genre: Essays, Gen, M/M, Machine Learning, Meta, Supervised Learning, statistics, use of academic singular we
Language: English
Status: Completed
Published: 2019-12-15
Updated: 2019-12-15
Packaged: 2021-02-26 05:27:00
Rating: General Audiences
Warnings: No Archive Warnings Apply
Chapters: 1
Words: 1,861
Publisher: archiveofourown.org
Story URL: https://archiveofourown.org/works/21778282
Author URL: https://archiveofourown.org/users/bafflinghaze/pseuds/bafflinghaze
Summary: Anonymous fests are a great way to read new works by authors without prior preconceptions of popularity.A particular feature of the Harry/Draco Fan Fair is the guessing poll, that emerges in the grace period after all works are revealed, but before all contributors are revealed. Guessing podficcers and artists is relatively straightforward. Guessing writers, however, is much more difficult.Here, we use supervised machine learning to guess the identity of the anonymous authors of the Harry/Draco Fan Fair 2019. We find that machine learning can predict up to 60% authors correctly. In contrast, the best guesses in the poll amounted to 18%, hence machine learning gives a marked advantage.
Relationships: Draco Malfoy/Harry Potter
Series: Draco Malfoy/Harry Potter Specific Statistics [3]
Series URL: https://archiveofourown.org/series/1546681
Comments: 14
Kudos: 44





	Supervised Machine Learning to guess the anonymous authors of the HD Fan Fair 2019

**Author's Note:**

> Acknowledgements to [Alpacapricot](https://archiveofourown.org/users/Alpacapricot) ❤️ who was really helpful in telling me about machine learning and language processing 101. Before, I had only done regression on numerical data.

## Introduction

The [Harry/Draco Fan Fair 2019](https://archiveofourown.org/collections/hd_fan_fair) is a long running prompt fest for Draco Malfoy/Harry Potter, focused around specific themes (i.e., unusual careers, pets, books, travel, and food). The works are posted anonymously, and at the conclusion of the fest is a guessing poll that runs for a few days before the final reveal of the authors.

The event includes artworks, written works, and podfics. Podfics are clearly identified by their narrator; With a small enough set of artists, it is typically feasible to judge by visual examination of art style. However, guessing the authors is a much more arduous task.

Hence, we ask: _is it possible to use machine learning to guess?_

The answer: yes! (Or rather, mostly.)

This is a common academic problem of _authorship attribution_. Different people write differently: from the structure of sentences, punctuation, and vocabulary, etc., and by analysing this data, it is possible to guess the author of the work.

Here, we use supervised learning. We are given the list of authors by the event organisers, and this allows us to gather previous writing samples from their AO3 page. The learning is 'supervised' which means we have previous works from each author, and we know who is the author in these previous works. We can use these works as example of each author's writing style (e.g. frequency of popular words, sentence length...). These samples are used to train and build a predictive model, which we then apply to the unknown works from the fest.

This document is organised as follows:

  * Results
  * Discussion
  * Methods
  * References



  
  


##  Results

Here is the word cloud from the fics in the [Harry/Draco Fan Fair 2019](https://archiveofourown.org/tags/H*s*D%20Fan%20Fair%202019/works):

  


This is, in comparison with the word cloud from all the older fics scraped from the authors:

The only need-to-know information to understand the following results:

  * Support Vector Machine (SVM), and Multilayer Perceptron are different machine learning algorithms (different algorithms are suited to different data and predictions)

  * F1-measure is a rating of how good the machine learning model is (based on the _known_ data fed). The bigger the better! Ideally, we would have f1 = 1.

  * the models are guessing on _Bag of Words_ , which is based on how often a word appears in a text. It is a common feature for text-classification. Common “stopwords” are removed, including articles (a, the…) and prepositions (in, on…). However, “Draco Malfoy” isn’t a common word in the general literature, hence names remain in the models. 

  
  


### Now, without further ado:

Best machine-learning-only guess: 27/44 = **61%**

Best machine learning + human intervention: 30/44 = **68%**

Best human-only guessing (from the guessing poll): was [maesterchill](https://archiveofourown.org/users/maesterchill) with 8 fics correct ([milkandhoney](https://archiveofourown.org/users/milkandhoney) also had 8, though they had helped on two of them) = 8/44 = **18%**

and [combined total correct guess from all guessers in the poll](https://hd-fan-fair.livejournal.com/179743.html) is 18/44 = **41%**

  


The complete results are shown in the table below. There is some randomness in the predictions each time it is run, so I picked the ones with the best f1 measure from a handful of runs (very imprecise, I know. Glad this isn’t an official paper!).

Machine learning alone was able to correctly guess _less than half_ of authors. However, some authors are never/rarely predicted ([fanfictionbubbles](https://archiveofourown.org/users/fanfictionbubbles) in particular who had no fics on their AO3 page), and some authors were predicted multiple times (especially [digthewriter](https://archiveofourown.org/users/digthewriter), [gracerene](https://archiveofourown.org/users/gracerene), [AhaMarimbas](https://archiveofourown.org/users/AhaMarimbas), [punk_rock_yuppie](https://archiveofourown.org/users/punk_rock_yuppie), and [ladderofyears](https://archiveofourown.org/users/ladderofyears) who have a lot more fics).

Furthermore, there is extra data that was not built into the machine learning models:

    * In the final predictions, an author can only appear _once_

    * I knew my own fic, **[Infuse With Affection, Enchant With Love](https://archiveofourown.org/works/20652239/chapters/49042691)** ;

    * I could confidently predict the author of **[Harry Potter and the Peahen from Perdition (tigersilver)](https://archiveofourown.org/works/20620829)** due to my familiarity with their distinct humour

    * I strongly believed that **[Baby Gate](https://archiveofourown.org/works/20691434)** was by **[donnarafiki](https://archiveofourown.org/users/donnarafiki)** , due to my familiarity with what they tend to write about, their mentions that they only read the short fics in the fest--and the fact that they didn’t comment on this short fic.

    * Some of the fest authors commented on many works, which means they _couldn't have_ written that work

    * Some authors had partially distinct author notes (i.e. content-wise, tone) that I also used to narrow down guesses

I can combine this extra knowledge with numerous runs of the machine learning in an adhoc way, and doing this increased the correct percentage from 55% to 68%.

*(Note that SVM using 99% of input data to train is technically poor methodology)

  
  


The fics that a human was able to guess are also marked on the results, which is interesting in comparing which fics that were easy to guess both by humans and machine learning—and which were easier for different methods. Authors that were predicted correctly by all methods (human and machine) are:

    * ravenclawsquill
    * ignatiustrout
    * mindabbles
    * bafflinghaze
    * donnarafiki
    * gracerene
    * tigersilver

which suggests that these authors likely have a combination of distinctive style and sufficiently many fics.

  
  


##  Discussion 

In this report, we used machine learning to predict the authors on the anonymous fics of the Harry/Draco Fan Fair 2019. We were able to predict approximately **60%** correctly (24/44), in comparison to the best human prediction of **18%** and best _combined_ human prediction of **41%**. Hence, despite the fact that the machine learning models were not perfect, they were able to significantly increase the number of correct predictions. Hence, using machine learning to predict authors of anonymous fanfic is a viable method.

There are multiple reasons why the machine learning models were imperfect. Aside from authors changing writing styles, and perhaps the existence of better machine learning algorithms, the input data and features are the two main problems

    1. Non-ideal input data

      * Some authors had much more works than others (and some authors had no data at all). Authors with few works are less likely to be selected by the algorithm.

      * As part of fanfiction writing, some words (e.g. character names, places, spells) are much more common. This makes it more difficult to distinguish different authors. This is compounded by the fact that some authors _did not_ have Draco/Harry or even Harry Potter fanfiction in their prior work to draw from. Because of that the algorithm will find that this author's previous works don't look like any of the works in the fest.

    2. Features in the model could be further fine-tuned:

    3. Removing fandom specific words to make fics from authors across different fandoms more equal

    4. Use re-sampling to make the total amount of words per author more balanced.

    5. include other data, such as:

      * total fic lengths

      * fic ratings

      * typically tags used and author note content (such as betas used and casual language)

      * average words per sentence

      * sentence length variation

      * lexical diversity (i.e. range of vocabulary)

      * average word length

      * paragraph lengths

      * commas/question marks/other punctuation per sentence

      * tense used (e.g. past/present/future) and POV (first/second/third)

      * formatting features such as preferred scene breaks, indentation etc.

We did not explore whether authors had more works on fanfiction.net or tumblr. Also, note that the prediction results are not “stable”: in different training iterations, a different random 80% of known data is used as input, for example, and this affects the subsequent predictive model produced.

Now, at the conclusion of the fest, all the authors in the fest now have at least one work. This would make predicting their authorship in future anonymous fests easier.

  
  


##  Methods 

This section is for those interested in more detailed explanations about all the steps of the process.

This section contains:

    * Data gathering
    * Features
    * Training and testing sets
    * [Model baseline](model_baseline)
    * Model training
    * Performance metrics

  


###  Data gathering 

First we need to get the works (fanfics):

    1. From the fest itself,

    2. and the prior works from the known authors.

This was done by scraping AO3 using the [AO3Scraper](https://github.com/radiolarian/AO3Scraper). From the list of authors and artists given, the known artists were removed (they can be determined by looking at their AO3 pages, with the word “[ART]” in titles, and/or zero wordcounts).

The generic works page for any user is https://archiveofourown.org/users/ **username** /works, and the AO3Scraper was used to retrieve all relevant works. Similarly, all the non-podfic, non-art-only works from the Harry/Draco Fan Fair 2019 were retrieved. Through this, I gathered all **metadata and all chapter words**.

  
  


###  Features 

In this report, we are using [scikit-learn](https://scikit-learn.org/stable/), which is a python package.

However, I also recommend [Weka](https://www.cs.waikato.ac.nz/ml/weka/), which has a graphical user interface.

Scikit-learn has a function called word vectorization that turns all the input words into features, i.e. by picking out the most **common words** and calculating the **frequency** of those words in each fic. Essentially, it turns input like this:

  
  


which contains text that cannot be directly used in machine learning, into input like this for example:

  
  


###  Training and testing sets (+unknown data set)

The unknown data set contained 44 fics. The known data set contained 2579 fics, of which some authors featured prominently (note that digthewriter was ultimately an author in the HD Fanfair 2019; however, given that they often both draw and write, I had kept them in).

20% of the input data was randomly selected to be used as the testing set. This is to test the predicting power of the machine learning model on data that we _do_ know.

  


###  Performance metrics 

Percentage of correct guesses is our main performance metric (averaged accuracy).

Meanwhile, the f1 measure (or f1 score) is a combined measure of both _precision_ and _recall_. High precision means that the model would only select answers that it was absolutely sure of, but will miss some results, e.g. for a given author, it would only guess fics it was certain to belong to that author. Meanwhile, with high recall, the model would select as much as possible that could be the answer--including fics by other authors, in our example.

  
  


###  Model baseline 

The model baseline is how much we can predict using very simple strategies, such as assigning every work to the most prolific author. We can only consider our model successful if it beats the results of this baseline.

At minimum, we want the model to be better than simply guessing randomly; and ideally, we want it to exceed human guesses.

If we take the strategy of guessing the most prolific author for all fics (here, digthewriter), then the mean accuracy is 0.18 and a f1-score of 0.05 (using sklearn’s dummy classifier).

As you can see in the results, the accuracy and f1-score ultimately achieved is much greater than this!

Alpacapricot calculated a baseline of 1.44 f-measure using only the frequency of each author with a NaiveBayes classifier on a 80/10 partition of known data.

We can also compare our model to human guess. The best human guesser in the poll obtained 18% accuracy. We also beat this metric with up to 60% accuracy.

  
  


###  Model training 

With the data organised, and testing and training data split, training the model is just one line of code:
    
          machine_learning_model.fit(Input_known_data, output_known_authors)
    

  
  


  
  


##  Some Selected References 

[ Authorship Attribution with Python ](http://www.aicbt.com/authorship-attribution/)

[ A Machine Learning Approach to Author Identification of Horror Novels from Text Snippets ](https://towardsdatascience.com/a-machine-learning-approach-to-author-identification-of-horror-novels-from-text-snippets-3f1ef5dba634)

[Random Forest in Python](https://towardsdatascience.com/random-forest-in-python-24d0893d51c0)

  
  





**Author's Note:**

> Machine learning to predict authors is unfair in comparison to human-only guesses. However, it leads to the possibility of two “competitions”, one of which uses human guesses, and one of which compares data analysis methods.
> 
> The data and code are available on request. 
> 
> Also, formatting this was a nightmare 😅
> 
> Finally, I would like to acknowledge [Alpacapricot](https://archiveofourown.org/users/Alpacapricot/pseuds/Alpacapricot) once again for their help. Any errors remain are my own.


End file.
